The Allotrope Data Format (ADF) [[!ADF]] consists of several APIs and taxonomies. This document constitutes the specification for the ADF Data Package (ADF-DP). It defines how to store multiple data files such as audio, video, images or text in a single package. The purpose of packaging is to ensure consistency and integrity of data files and meta-data during storage and transfer. Files stored in the data package can represent measurements or results of an experiment or process described in the data description.
THESE MATERIALS ARE PROVIDED "AS IS" AND ALLOTROPE EXPRESSLY DISCLAIMS ALL WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, INCLUDING, WITHOUT LIMITATION, THE WARRANTIES OF NON-INFRINGEMENT, TITLE, MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.
This document is part of a set of specifications on the Allotrope Data Format [[!ADF]]
The Allotrope Data Format defines an interface for experimental data generated in analytical laboratory processes. It is intended for data exchange, long-term preservation and fast real-time data access. The ADF Data Package API defines an interface for storing files and folder structures and thus provides one of most essential operations of the ADF. A manual for using the ADF Data Package API is given in [[!ADF-DP-DG]].
The document is structured as follows: Next the role of the ADF Data Package API within the high level structure is explained. Then the functional as well non-functional requirements for the ADF Data Package API are listed and based on that the ADF Data Package Architecture is described in detail. Then an overview of the ADF-DP ontology as well as a detailed description of the operations on files and folders are given.
Within this specification, the following namespace prefix bindings are used:
Prefix | Namespace |
---|---|
owl: | http://www.w3.org/2002/07/owl# |
rdf: | http://www.w3.org/1999/02/22-rdf-syntax-ns# |
rdfs: | http://www.w3.org/2000/01/rdf-schema# |
xsd: | http://www.w3.org/2001/XMLSchema# |
skos: | http://www.w3.org/2004/02/skos/core# |
dct: | http://purl.org/dc/terms/ |
foaf: | http://xmlns.com/foaf/0.1/ |
adf-dp: | http://purl.allotrope.org/ontologies/datapackage# |
pav: | http://purl.org/pav/ |
Within this document, decimal numbers will use a dot "." as the decimal mark.
Within this document the definitions of MUST, MUST NOT, SHOULD and MAY are used as defined in [[!rfc2119]].
The following figure illustrates the high level structure of the Allotrope Data Format (ADF) [[!ADF]] API stack:
This document focuses on the Data Package API. The ADF Data Package Ontology is described in [[!ADF-DPO]].
This section first introduces the key requirements to the Allotrope Data Package (ADF-DP) and then describes these in detail.
The key requirements for ADF-DP are:
This section describes the functional and non-functional requirements in detail.
The ADF Data Package API has the following functional requirements:
The ADF Data Package API has the following non-functional requirements:
The ADF Data Package API has the following interoperability requirements:
This section describes the architecture of ADF-DP.
The following figure illustrates the high level architecture of the Allotrope Data Format (ADF):
Given an ADF file, the Data Package is located in the top-level folder 'data-package'. The Data Package MAY contain files and folders. Meta data of these files and folders are stored in the ADF Triple Store (ADF-QS) via the ADF Data Description (ADF-DD) API.
This section describes the mapping of the Allotrope Data Package to [[!HDF5]].
In HDF5, the Allotrope Data Package MUST be represented as follows:
Characteristics of the Mapping:
The implementation of the DpFile Revisions is based on the data written by the AuditTrail. For performance reasons it duplicates all data needed for the revision feature to the AuditTrail node for the DpFile (IRI of the DpFile).
This feature has been designed to work with files from older version which do not support revisions. This means an ADF file could be written to by different versions of the ADF library (some supporting revisions, some not). And after writing with a version supporting revisions the full revision history of the written DpFile will be available. This will not break the feature.
The standard update mechanism is linked to the AuditRecordClosingEvent
which triggers an update of
the revision history of all DpFiles changed in the AuditRecord. Also, to support ADF files which might have
been modified by an older Version of the ADF libary, on every access to a files revision history the last entry
is checked against the AuditTrail of the DpFile. If the two entries match no further action will be taken.
If the AuditEntry is younger than the last revision entry all data needed for a new revision entry will be stored in a list and the next older AuditEntry will be checked. This continues until the AuditEntry has been found that matches the original last revision entry. Now the new revision entries can be created and numbered continuously.
If the ADF file can be modified these changes will be written directly to the AuditTrail model. Otherwise they are only written to an in-memory model and disgarded after use. This way even for older files which must not be changed the revision feature can be used - just slower since the AuditTrail has to be analysed every time.
All revision entries are connected to the file's Resource in the Audit model by the property
pav:hasVersion
. Also they form a doubly linked list with pav:nextVersion
pointing to
the next revision and pav:previousVersion
pointing to the previsous revision. The revision node
itself has a UUID based IRI.
The DpFile's Resource in the Audit model needs to have a rdf:type
entry for prov:Entity
and needs to be linked to the DataPackage node adf://dp
by dct:hasPart
,
ore:aggregates
and prov:hadMember
. It also has an entry for pav:version
with the current revision number as its value and points to the end of the doubly linked revision list with
pav:previousVersion
.
Values that should be present at each revision entry node:
pav:version
- Revision number, starting with 1 at the oldest revisionadf-audit:hasAuditRecord
- AuditRecord describing the change from this version to the nextadf-dp:fileSize
- File size in this revisionadf-dp:modifiedBy
- Modifying agent in this revisiondct:modified
- Modification time of this revisionhttp://www.loc.gov/premis/rdf/v1#hasMessageDigest
- The message digest of this revisionIf values like adf-dp:modifiedBy
or http://www.loc.gov/premis/rdf/v1#hasMessageDigest
are not changed in an AuditRecord their values are ommited in that record. This also means they cannot be
copied for the revision history. But when finally a change of the value happens the old value is noted in the
AuditRecord and older revision entries (those missing the value) can be updated. If no change of the value has
happened yet, its correct value can be found in the current data and be used to updated the revision history.
The check and updated happens after every writing of an revision entry.
In case the DpFile has been deleted or recreated (delete and create in one AuditRecord) the HDF5 dataset
representing the file in its old state will have been moved to an AuditTrail archive folder. So the revisions
are no longer represented by the current state of the file. This is being addressed by adding a
adf-dp:representedBy
pointing to the adf-audit:archivedTo
value of the AuditRecord's
removal entry. This needs to be done for the previous revision entries until the first one that already has an
adf-dp:representedBy
entry.
This section describes the operations that can be carried out with the Data Package API. First, the operations on folders are described, then the operations on files.
The terminology that is used below is defined in [[!ADF-DPO]].
The Data Package API MUST provide a method that offers access to the Data Package contained in a given ADF file. The method takes as argument the ADF file and returns a reference to the Data Package contained in the file.
The Data Package API MUST provide a method to create a folder in another folder of the Data Package. The method takes the name of the new folder as argument and returns a reference to the new folder.
The method MUST throw an exception, if a folder or file with this name already exists in the target folder.
The name of the folder MUST fulfil the interoperability requirements (e.g. avoid invalid characters, see subsection above). If not, the method MUST throw an exception.
The Data Package MUST contain the new Folder, which MUST have as parent the target Folder.
For the folder, an HDF5 group MUST have been created in the HDF5 group representing the target folder. HDF5 named objects like Groups or Datasets MUST use the identifier part of the UUID URN of the resource as specified in [[!rfc4122]]. The name of the new HDF5 group MUST be the UUID part of <folder-URI>, where folder-URI is a UUID URN using a type 4 (pseudo randomly generated) UUID.
Meta data of the folder MUST have been added to the ADF-QS via ADF-DD that fulfil the following requirements:
dct:identifier
, type xsd:string
dct:title
, type xsd:string
dct:created
, type xsd:dateTime
dct:creator
, type dct:Agent
. If not exists, an instance of foaf:Person
with dct:identifier <userid>
is created.dct:modified
, type xsd:dateTime
adf-dp:modifiedBy
, type dct:Agent
dct:isPartOf
, type adf-dp:Folder
. The only exception here is the root folder. From the parent folder an inverse relation dct:hasPart
and ldp:contains
are set, referencing the newly created folder. The values of dct:modified
and adf-dp:created
of the parent folder are updated to hold the same values as the newly created folder.adf-dp:representedBy
, type hdf:Group
. The URI is built from the absolute path of the folder within the HDF5 file.The method has to be transactional.
When creating the root folder, an UUID URN URI MUST be generated (folder-URI) in the ADF-QS and the class of the folder stated (adf-dp:Folder and ldp:Container). The only difference of requirements when creating a root folder is that a "parent folder" MUST NOT be specified.
The ADF-DP API MUST provide a method to access a given folder in a Data Package. The folder MAY be given by URI, by absolute path or by relative path. The method takes the name resp. the URI of the folder as argument and returns a reference to this folder, if it exists.
The method MUST throw an exception, if the folder given by URI or by path does not exist.
The folder returned by this method MUST fulfil the following requirements for accessing a folder:
dct:identifier
, type xsd:string
dct:title
, type xsd:string
dct:created
, type xsd:dateTime
dct:creator
, type dct:Agent
.dct:modified
, type xsd:dateTime
adf-dp:modifiedBy
, type dct:Agent
dct:isPartOf
, type adf-df:Folder
.The folder returned by this method fulfils the requirements of the method of accessing a folder described above.
When updating a folder, the meta data of this folder MUST fulfil the following requirements:
dct:identifier
, type xsd:string
dct:title
, type xsd:string
dct:created
, type xsd:dateTime
. May not be changed.dct:creator
, type dct:Agent
. May not be changed.dct:modified
, type xsd:dateTime
adf-dp:modifiedBy
, type dct:Agent
dct:isPartOf
and ldp:member
, type adf-df:Folder
. From the parent folder an inverse relation dct:hasPart
and ldp:contains
MUST be set, referencing the updated folder. The values of dct:modified
and adf-dp:modifiedBy
of the parent folder are updated to hold the same values as the newly created folder.adf-dp:representedBy
, type hdf:Group
. The URI is built from the absolute path of the folder within the HDF5 file.The ADF-DP API SHOULD provide a method to list the contents of a folder in a given Data Package. The method MUST return an empty list, if the folder does not have any contents.
The metadata of files and folders to load is based on the requirements for accessing files and accessing folders.
The ADF Data Package API MUST provide a method to remove an empty folder within a Data Package. If the given folder does not exist in the Data Package, the method MUST throw an exception. If the folder is not empty, the method MUST throw an exception. If the folder is referenced by other predicates in ADF-QS, it MUST NOT be removed and the method MUST throw an exception.
Here "removal" means that the folder is removed from the Data Package - it does not need to be removed physically in HDF! Further meta data of the folder are removed from the ADF-QS.
When creating a file, a UUID URN URI MUST be generated (file-URI) and the class of the file stated (adf-dp:File
and ldp:Resource
) in the ADF-QS.
In addition the following requirements for the class adf-dp:File
MUST be fulfilled at creation time:
dct:identifier
, type xsd:string
dct:title
, type xsd:string
dct:created
, type xsd:dateTime
dct:creator
, type dct:Agent
.dct:modified
, type xsd:dateTime
adf-dp:modifiedBy
, type dct:Agent
dct:format
, value from [[MediaTypes]] (default value is http://purl.org/NET/mediatypes/application/octet-stream)adf-dp:fileSize
, type xsd:long
adf-dp:charset
, type xsd:string
. The character set (encoding) is required, if file format is text.adf-dp:lineSeparator
, type xsd:string
. The line separator is required, if file format is text.dct:isPartOf
and ldp:member
, type adf-df:Folder
adf-dp:representedBy
, type hdf:Dataset
The ADF-DP API MUST provide a method to access a given file in a Data Package. The file MAY be given by URI, by absolute path or by relative path. The method takes the name resp. the URI of the file as argument and returns a reference to this file, if it exists.
The method MUST throw an exception, if the file given by URI or by path does not exist.
The file returned by this method MUST fulfil the following requirements:
dct:identifier
, type xsd:string
dct:title
, type xsd:string
dct:created
, type xsd:dateTime
dct:creator
, type dct:Agent
.dct:modified
, type xsd:dateTime
adf-dp:modifiedBy
, type dct:Agent
dct:format
, value from [[MediaTypes]] (default value is http://purl.org/NET/mediatypes/application/octet-stream)adf-dp:fileSize
, type xsd:long
adf-dp:charset
, type xsd:string
. The character set (encoding) is required, if file format is text.adf-dp:lineSeparator
, type xsd:string
. The line separator is required, if file format is text.dct:isPartOf
and ldp:member
, type adf-df:Folder
When updating a file, the following meta data of the class adf-dp:File MUST be fulfilled at modification time and updated in the ADF-QS.:
dct:identifier
, type xsd:string
. The identifier MAY NOT be changed (MUST be exactly the UUID part of the URI).dct:title
, type xsd:string
dct:created
, type xsd:dateTime
. May not be changed.dct:creator
, type dct:Agent
. MAY NOT be changed.dct:modified
, type xsd:dateTime
. MUST be set to the current system time.adf-dp:modifiedBy
, type dct:Agent
. The user MUST be set to the current system user, the agent SHOULD be an instance a foaf:Person with dct:identifier = <username>dct:format
, value from [[MediaTypes]] (default value is http://purl.org/NET/mediatypes/application/octet-stream)adf-dp:fileSize
, type xsd:long
. MAY need to be changed.adf-dp:charset
, type xsd:string
. The character set (encoding) is required, if file format is text.adf-dp:lineSeparator
, type xsd:string
. The line separator is required, if file format is text.dct:isPartOf
and ldp:member
, type adf-df:Folder
. The requirements for updating a folder MUST be fulfilled at modification time for the parent folder.The ADF-DP API MUST provide a method to write bytes into a given File within a Data Package, using a streaming API.
The method MUST support the following use cases:
The method MUST support variable length files. The chunk size MUST be configurable and preset to a reasonable value. When using a chunked HDF5 dataset and exceeding the size of the current chunk, a new chunk MUST be allocated and used implicitly. The method MAY support fixed length files in a later release of ADF, e.g. for performance optimization of read-only files (using HDF5 opaque datatype).
In case of CREATE, a variable length HDF5 dataset using the HDF5 VLEN type based on byte MUST have been created, representing the new File. The bytes MUST have been written into this HDF5 dataset. Meta data of the file MUST have been added to the ADF-QS via ADF-DD as defined in Creating a File.
In the non-exceptional case of CREATE_NEW, the bytes MUST have been written into the existing HDF5 dataset representing the file, starting at the first index position. Meta data of the file MUST have been updated in the ADF-QS via ADF-DD according to the requirements for updating a file.
In case of TRUNCATE_EXISTING, the bytes MUST have been written into the existing HDF5 dataset representing the file, starting at the first index position. Meta data of the file MUST have been updated in the ADF-QS via ADF-DD according to the requirements for updating a file.
In case of APPEND, the bytes MUST have been written into the existing HDF5 dataset representing the file, starting at the last index position. Meta data of the file MUST have been updated in the ADF-QS via ADF-DD according to the requirements for updating a file.
The method MUST set (in case of create) respectively adjust (in case of update) the inherent meta data of the file (e.g. file size, etc.). The implementation MUST add RDF statements to ADF-QS via ADF-DD according to the requirements for creating a file respectively the requirements for updating a file.The ADF Data Package API MUST provide a method for the removal of files. Here "Removal" means that the file is marked as removed, but not actually physically removed - this holds for both, its HDF5 representation and its RDF representation.
In order to be able to remove a file, it MUST be ensured that it is not referenced by a resource in ADF-QS.
The ADF-DP API MUST provide methods to import files and folders into ADF and methods to export these files and folders.
The import service creates ADF-DpFiles and ADF-DpFolders according to the given source and MUST preserve the hierarchy of the source. Additionally, the following retrieval and source meta-data are stored:
pav:retrievedBy
, type dct:Agent
pav:retrievedBy
, type xsd:dateTime
pav:retrievedFrom
dct:title
, type xsd:string
pav:createdBy
, type dct:Agent
pav:createdOn
, type xsd:dateTime
dct:modified
, type xsd:dateTime
adf-dp:URL
, type xsd:anyURI
adf-dp:path
adf-dp:hostname
, type xsd:string
adf-dp:path
, type xsd:string
adf-dp:path
, type dct:identifier
The method MUST throw an exception if either the source path or the target ADF-DPFolder do not exist.
The export service restores the files and folders of the original source to the specified target. Symbolic links are replaced by their target.
The method MUST throw an exception if either the source ADF-DpFile or ADF-DPFolder or the target path do not exist.
Version | Release Date | Remarks |
---|---|---|
0.3.0 | 2015-04-30 |
|
0.4.0 | 2015-06-29 |
|
1.0.0 RC | 2015-09-17 |
|
1.0.0 | 2015-09-29 |
|
1.1.0 RC | 2016-03-11 |
|
1.1.0 RF | 2016-03-31 |
|
1.1.5 | 2016-05-13 |
|
1.2.0 Preview | 2016-09-23 |
|
1.2.0 RC | 2016-12-07 |
|
1.3.0 Preview | 2017-03-31 |
|
1.3.0 RF | 2017-06-30 |
|
1.4.3 RC | 2018-10-11 |
|
1.4.5 RF | 2018-12-17 |
|
1.5.0 RC | 2019-12-12 |
|
1.5.0 RF | 2020-03-24 |
|
1.5.3 RF | 2020-11-30 |
|